RNAseq differential expression analysis of TEs in Drosophila melanogaster (DmGoth lineage)

Alignment to TE Consensus sequences

TE consensus sequences are contained in D_mel_transposon_sequence_set.fa (downloaded from https://github.com/bergmanlab/transposons/ ) By aligning reads to these sequences, we can identify which TEs are expressed and get countings.

Commands :

~/softs/minimap2/minimap2 -ax splice ~/cristina_TE/data/consensus/D_mel_transposon_sequence_set.fa ~/cristina_TE/data/reads/FC29.fastq.gz | samtools sort -o FC29.against_consensus.bam

~/softs/minimap2/minimap2 -ax splice ~/cristina_TE/data/consensus/D_mel_transposon_sequence_set.fa ~/cristina_TE/data/reads/FC30.fastq.gz | samtools sort -o FC30.against_consensus.bam

We get two bam files as a result (FC29 : 5.2Go, FC30 : 1.8Go )

Some informations on our reads...

Here's an example of what's in D_mel_transposon_sequence_set.fa :

>1731#LTR/Copia
TGTTGAATATAGGCAATGCCCACATGTGTGTTGAATATAGGCAATTTCCACATGTGCATA
TGTAATTTTGTATGAGAACATACATACATACACATGAACTGTATGTATGTATATATATTA

And here's the beginning of the BAM we get by aligning our reads to the TEs consensus sequences :

e1bae963-07b0-44fa-8a67-ae90f86c4ba0    2064    Tc3#DNA/Tc1-Mariner     331     38      710H32M5D19M2D8M1D5M1D18M1D4M1D124M1384H        *       0
       0       GGTTTTACGCTTCTAACTTGACTTCTTGTTTGTTAAATCTCGAAAGTTAAATTCTTTTGATTCTAAATATAAATTATCTTTTTAATTTTTTCTCAAATGGTCCGCGAAAAGTCTTTATCCGATTTTGAAAAAGGTCAAATCAAAGGCTATATTGAATCTGGTTTAAAACACTGTGTAATAGCCAAGAAAATCGGTTGAAGTCAAAACGTT      53/a:/-$7)4/($,%"/$#''-2#"$-5'"''$#'&$%'$"+2%4)*+/)(:?-L/$)2(('#/7+(,$+.#&%(%#&'0-%+(*16995,0.*?;3;9;+*+*/(--)&"%()%8<89210OG?*)NPMPKGC)6>:>*/H5?;-%3/2B4.DLL./A8<:*'A>-"1%#)0255A<*5H2.N(JQ@=J+?@A+-6L7K4)KT1=@WR      NM:i:35 ms:i:115        AS:i:115        nn:i:0  tp:A:P  cm:i:9  s1:i:64 s2:i:0  de:f:0.1389     SA:Z:McClintock#LTR/Gypsy,592,+,1229S153M2I920S,60,11;  rl:i:0

We want to count how many reads are aligned on each TEs sequences BUT, we need to remind ourselves that a read can be aligned to several features, then, which one should we count ?

Let's get a look at the alignQC report for FC30.bam

alignqc analyze FC30.sorted.bam -g D_mel_transposon_sequence_set.fa -o FC30.consensus.report/FC30.consensus.alignqc.xhtml

Well, that didn't work. Seems like a Windows return carriage error ? Maybe this could help :

tr -d '\r' < D_mel_transposon_sequence_set.fa > D_mel_transposon_sequence_set.fixed.fa

NGS crashed in the meantime, and it did not help : we still got the same error message :

  File "/data/home/ecumunel/.local/lib/python2.7/site-packages/seqtools/errors.py", line 813, in get_context_target_errors
    r[t][tafter]['-']['total'] += 0.5
KeyError: 'Y'

Another thing to take into account : They are primary, secondary, supplementary and unmapped reads. Which one should we count and not count ? I guess we should discard supplementary alignments and unmapped reads. Let's check manually how many primary and secondary reads we got.

Example case of a supplementary read : e1bae963-07b0-44fa-8a67-ae90f86c4ba0

This read generate two alignments : one primary (in McClintock#LTR/Gypsy) and one supplementary (in Tc3#DNA/Tc1-Mariner). The alignment score is better in the primary. In both case, the alignment covers a very small part of the consensus sequence...

So we've got around 3 millions unmapped reads, 400 supplementary, 1000 secondary and 7369 primary alignments. This seems very low, but considering we only aligned against TE consensus sequences, which is like 2% of the transcriptome, it's actually not so bad. And FC29 is the smallest dataset.

A more helpful number should be the number of single mapped read (read that mapped to only one feature)
TODO : fix the following script.

Generating the dataframe : Counts per Superfamily, family, and TE

Getting mean subject coverage...

Exporting dataframe to csv...

Importing dataframe from CSV

Barplot of TE expression ratio (from consensus data)

Pie plots : Ratio of each superfamily / family / TE

Some trouble with the query coverage of some TEs...

The query coverage of some TE can be very low, as seen here on IGV for 1731#LTR/Copia : title

When we blatted the sequence of read e687c074-c5e2-4c85-9ba0-492e6a4a8a84 (in black on previous picture), this is what we got :

title

Our read is divided in at least four part :

One part of our read (in pink from 2385 to 4145) is aligned a bunch of time against the duplication 1731 of Copia, in several chromosomes (which is normal for a TE). Another one (in orange from 4 to 1464) is aligned three time to genes coding for a very similar protein (FBtr0081639 = alphaTub84B, which is a gene from chr3R:7086599-7088839, FBtr0081538 =alphaTub84D, another gene from chr3R:7530753-7532732 and FBtr0082087 =alphaTub85E, from chr3R:9731221-9733250).

This read seems to be chimeric : it looks like several reads glued together. To check if it is the case, we could have a look at the different sequence between each part. Maybe it's a redundant sequence that we could find in other reads, in which case it would suggest that it is an adaptator or something... We should align this in-between sequence and check if we can find it elsewhere.

Only pink2purple has a blat result. Let's compare those sequences to another read aligned to copia dup 1731 (7596c024-a22d-48a3-b6e4-ec4d75980c0a)

The new read can be divided in 3 parts : Copia, gene1 and gene2. title

Countings by position from genomic alignments